library(tidyverse) # for graphing and data cleaning
library(gardenR) # for Lisa's garden data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
theme_set(theme_minimal()) # My favorite ggplot() theme :)
# Lisa's garden data
data("garden_harvest")
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the wday() function from lubridate). Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
mutate(day = wday(date, label = TRUE)) %>%
group_by(vegetable, day) %>%
summarise(total_weight = sum(weight)) %>%
pivot_wider(id_cols = vegetable:total_weight,
names_from = "day",
values_from = "total_weight")
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot from the garden_planting table. This will not turn out perfectly. What is the problem? How might you fix it?garden_harvest %>%
group_by(variety) %>%
summarise(total_weight_lbs = sum(weight * 0.00220462)) %>%
left_join(garden_planting,
by = "variety") %>%
select(variety:plot)
Some of the vegetable varieties show up twice, with the same total weight. If each variety was weighed and recorded with their plot this could help to solve this issue. We could also average the weight out proportionally based on the number of seeds planted in each plot for each variety.
garden_harvest and garden_spending datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.Using the garden harvest dataset as the record of vegetables harvested and their weights, we could join this with the garden spending dataset by vegetable and variety to find the seed price for each variety. The price of the top soil and dirt would be lost in this join process but this could be added later as a constant cost for each vegetable. With the Whole Foods dataset we could perform another join function, this time just by vegetable. Now we would have a dataset with the weight of each vegetable, the seed price and the price of it in a store. We could then group this data by vegetable and create two new variables, the total seed price and the total cost of it at the store (adjusted for weight). With these new variables we could create a third variable for each vegetable, money saved, which is the seed price - store price.
garden_harvest %>%
filter(vegetable == "tomatoes") %>%
group_by(variety) %>%
summarise(total_weight_lbs = sum(weight * 0.00220462),
first_harvest_date = min(date)) %>%
arrange(first_harvest_date) %>%
ggplot(mapping = aes(x = total_weight_lbs, y = fct_reorder(variety, first_harvest_date, .desc = TRUE))) +
geom_col() +
labs(title = "Tomato harvest weights by variety, ordered by first harvest date",
x = "",
y = "")
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
mutate(lower_variety = str_to_lower(variety),
length_variety = str_length(lower_variety)) %>%
distinct(lower_variety, .keep_all = TRUE) %>%
arrange(vegetable, length_variety)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
distinct(variety, .keep_all = TRUE) %>%
group_by(vegetable) %>%
filter(str_detect(variety, "er|ar"))
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usually, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().options(scipen = 10)
Trips %>%
ggplot(mapping = aes(x = sdate)) +
geom_density() +
labs(title = "Number of bike rentals by date",
x = "Date",
y = "Density")
As the months get colder, the number of rentals decrease. There is an increase in the early part of December before decreasing again in January.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.options(scipen = 10)
Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_day = hour + minute/60) %>%
ggplot(mapping = aes(x = time_day)) +
geom_density() +
xlim(0, 24)+
labs(title = "Number of bike rentals by time of day",
x = "Time of Day",
y = "Density")
There are two rental peaks, one around 9:00AM and another around 5:30PM. This could mean that people are using the bikes to commute. In the middle of the day there is a lull, but bike usage is still higher than other parts of the day, such as the early morning or late evening.
Trips %>%
mutate(day = wday(sdate, label = TRUE)) %>%
ggplot(mapping = aes(y = day)) +
geom_bar() +
scale_y_discrete(limits = c("Sat", "Fri", "Thu",
"Wed", "Tue", "Mon", "Sun")) +
labs(title = "Bike rentals by day of the week",
x = "",
y= "")
This plot adds further evidence from my previous hypothesis that people are potentially using the bikes to commute. This is because during the week days, there are more bicycle rentals than during the weekends.
options(scipen = 10)
Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_day = hour + minute/60,
day = wday(sdate, label = TRUE)) %>%
ggplot(mapping = aes(x = time_day)) +
geom_density() +
xlim(0, 24)+
labs(title = "Number of bike rentals by time of day, separated by day of the week",
x = "Time of Day",
y = "Density") +
facet_wrap(vars(day))
There is a noticeable pattern. On weekdays, people are using the bikes to commute as evidenced by the two peaks, one at the start of the work day and another at the end. On the weekends, people are riding the bikes in the early to late afternoon.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises.
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.options(scipen = 10)
Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_day = hour + minute/60,
day = wday(sdate, label = TRUE)) %>%
ggplot(mapping = aes(x = time_day, fill = client)) +
geom_density(color = NA, alpha = .5) +
xlim(0, 24)+
labs(title = "Number of bike rentals by time of day, separated by day of the week",
x = "Time of Day",
y = "Density",
fill = "Client Type") +
facet_wrap(vars(day))
It appears as though registered riders are more likely to use the bikes for their commute, at the beginning or end of the work day. Casual riders are more likely to use the bikes in the middle of the day and during weekends.
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?options(scipen = 10)
Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_day = hour + minute/60,
day = wday(sdate, label = TRUE)) %>%
ggplot(mapping = aes(x = time_day, fill = client)) +
geom_density(color = NA, alpha = .5, position = position_stack()) +
xlim(0, 24)+
labs(title = "Number of bike rentals by time of day, separated by day of the week",
x = "Time of Day",
y = "Density",
fill = "Client Type") +
facet_wrap(vars(day))
In my opinion, this is worse for telling a story. It makes it harder to see where each type of rider is using the bikes. At best it tells us which are the most common times for bike rentals overall. In terms of the story I was trying to tell with commuters, this version of the plot doesn’t help. It makes it appear that casual and registered riders use the bikes at similar times.
position = position_stack()). Add a new variable to the dataset called weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.options(scipen = 10)
Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_day = hour + minute/60,
day = wday(sdate, label = TRUE),
weekend = ifelse(day == "Sat" | day == "Sun", "Weekend", "Weekday")) %>%
ggplot(mapping = aes(x = time_day, fill = client)) +
geom_density(color = NA, alpha = .5) +
xlim(0, 24)+
labs(title = "Number of bike rentals by time of day, separated by weekday or weekend",
x = "Time of Day",
y = "Density",
fill = "Client Type") +
facet_wrap(vars(weekend))
Similar to the other plots, this shows that there is a clear difference in when the types of riders use the bikes, casual riders during the middle part of the day and registered riders during their commute. Interestingly, there are a group of registered riders that are using the bikes in the early hours of the weekend mornings.
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?options(scipen = 10)
Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time_day = hour + minute/60,
day = wday(sdate, label = TRUE),
weekend = ifelse(day == "Sat" | day == "Sun", "Weekend", "Weekday")) %>%
ggplot(mapping = aes(x = time_day, fill = weekend)) +
geom_density(color = NA, alpha = .5) +
xlim(0, 24)+
labs(title = "Number of bike rentals by time of day, separated by client type",
x = "Time of Day",
y = "Density",
fill = "") +
facet_wrap(vars(client))
It appears that this tells us a very similar story to the previous graphs. It makes it easier to compare trends within each type of client. Casual riders use the bikes at similar times but more frequently on the weekends. Registered riders use the bikes at completely different times on the weekends, usually during the middle of the day. I think these two graphs are both useful, it just depends what story you are trying to tell.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Trips_Station <- Trips
names(Trips_Station)[names(Trips_Station) == "sstation"] <- "name"
Trips_Station %>%
group_by(name) %>%
summarise(num_departures = n()) %>%
left_join(Stations, by = "name") %>%
ggplot(mapping = aes(x = lat, y = long, color = num_departures)) +
geom_point() +
labs(title = "Number of departures from each station",
x = "",
y = "",
color = "Number of Departures") +
scale_color_viridis_c()
Trips_Station %>%
group_by(name, client) %>%
summarise(num_departures = n()) %>%
mutate(total_departures = sum(num_departures),
percent_casual = ifelse(client == "Casual", num_departures/total_departures, 0)) %>%
filter(percent_casual != 0) %>%
left_join(Stations, by = "name") %>%
ggplot(mapping = aes(x = lat, y = long, color = percent_casual)) +
geom_point() +
labs(title = "Number of departures from each station",
x = "",
y = "",
color = "Number of Departures") +
scale_color_viridis_c()
The casual users have higher percentages of bicycle usage in the stations that are further out. They have lower percentages in the stations that are clustered together in the upper left of the plot.
as_date(sdate) converts sdate from date-time format to date format.top_ten_station_data <- Trips %>%
mutate(date = as_date(sdate)) %>%
group_by(sstation, date) %>%
summarise(num_departures = n()) %>%
arrange(desc(num_departures)) %>%
head(10)
top_ten_station_data
Trips %>%
mutate(date = as_date(sdate)) %>%
semi_join(top_ten_station_data,
by = c("sstation", "date"))
Trips %>%
mutate(date = as_date(sdate),
dow = wday(date, label = TRUE)) %>%
semi_join(top_ten_station_data,
by = c("sstation", "date")) %>%
group_by(client, dow) %>%
summarise(num_departures = n()) %>%
mutate(total_departures = sum(num_departures)) %>%
group_by(client, dow) %>%
summarise(prop = num_departures/total_departures) %>%
pivot_wider(id_cols = client:prop,
names_from = "client",
values_from = "prop")
Of the top 10 days and station combinations, none of these days are Fridays. The vast majority of the casual rides happen on the weekend, with over half coming on Saturday. The opposite is true of the registered riders, with the majority coming during the week. A large majority of the registered riders rides come on Thursday. This fits in with the story told by the rest of the plots, registered riders are using the bikes to commute and casual riders use them to get around on the weekends.
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.
https://github.com/abmhopkins/STAT112_Exercise_03/blob/main/03_exercises.md
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.kids %>%
filter(variable == "lib",
year == 1997 | year == 2016) %>%
select(state, year, inf_adj_perchild) %>%
pivot_wider(id_cols = state:inf_adj_perchild,
names_from = "year",
values_from = "inf_adj_perchild") %>%
mutate(increase = ifelse(`2016`>`1997`, TRUE, FALSE)) %>%
ggplot() +
geom_point(aes(x = 0, y = `1997`*1000, color = ifelse(increase == TRUE, "white", "black")), size = 1.5) +
geom_segment(aes(x = 0,
xend = 1,
y = `1997`*1000,
yend = `2016`*1000,
color = ifelse(increase == TRUE, "white", "black")),
arrow = arrow(length = unit(0.075, "inches")),
size = 0.8) +
geom_text(aes(x = 0,
y = `1997`-200,
label = round(`1997`*1000)),
hjust = 0.5,
size = 3,
color = "grey") +
geom_text(aes(x = 1.05,
y = `2016`-200,
label = round(`2016`*1000)),
hjust = 0.5,
size = 3,
color = "grey") +
scale_color_identity()+
facet_geo(vars(state), label ="code") +
theme(plot.background = element_rect(fill = "#6C8299"),
panel.grid = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
plot.margin = margin(5,20,5,20),
panel.spacing.y = unit(0.75, "lines"),
panel.spacing.x = unit(1.25 , "lines"),
strip.text = element_text(color = "white"),
legend.position = "none") +
coord_cartesian(clip = "off", expand = FALSE) +
labs(title = "Change in public spending on libraries from 1997 to 2016",
subtitle = "Dollars spent per child, adjusted for inflation",
x = "",
y = "",
caption = "Source: Urban Institute | Inspired by: Georgios Karamanis | Graphic: Alexander Hopkins")
DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?